In [1]:
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
In [2]:
!pip install h2o
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting h2o
  Downloading h2o-3.40.0.2.tar.gz (177.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 177.6/177.6 MB 4.9 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
Requirement already satisfied: requests in /usr/local/lib/python3.9/dist-packages (from h2o) (2.25.1)
Requirement already satisfied: tabulate in /usr/local/lib/python3.9/dist-packages (from h2o) (0.8.10)
Requirement already satisfied: future in /usr/local/lib/python3.9/dist-packages (from h2o) (0.16.0)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.9/dist-packages (from requests->h2o) (2.10)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.9/dist-packages (from requests->h2o) (2022.12.7)
Requirement already satisfied: chardet<5,>=3.0.2 in /usr/local/lib/python3.9/dist-packages (from requests->h2o) (4.0.0)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.9/dist-packages (from requests->h2o) (1.26.15)
Building wheels for collected packages: h2o
  Building wheel for h2o (setup.py) ... done
  Created wheel for h2o: filename=h2o-3.40.0.2-py2.py3-none-any.whl size=177693439 sha256=c889ea192bd3ffe65aff7da84c509223a41d1df2158b3e6621b0639ec5aa4c63
  Stored in directory: /root/.cache/pip/wheels/b2/79/e3/842b81607eb31946ee24898cc9961b101e6486f988a5103967
Successfully built h2o
Installing collected packages: h2o
Successfully installed h2o-3.40.0.2
In [3]:
import h2o
from h2o.automl import H2OAutoML

h2o.init()

# Import data
f = "/content/drive/MyDrive/Qualities in Intelligent Systems/Bike-Sharing-Dataset/day.csv"
df = h2o.import_file(f)

# Reponse column
y = "cnt"

# Split into train & test
splits = df.split_frame(ratios = [0.8], seed = 1)
train = splits[0]
test = splits[1]

# Run AutoML for 1 minute
aml = H2OAutoML(max_runtime_secs=60, seed=1)
aml.train(y=y, training_frame=train)

# Explain leader model & compare with all AutoML models
exa = aml.explain(test)

# Explain a single H2O model (e.g. leader model from AutoML)
exm = aml.leader.explain(test)
Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
  Java Version: openjdk version "11.0.18" 2023-01-17; OpenJDK Runtime Environment (build 11.0.18+10-post-Ubuntu-0ubuntu120.04.1); OpenJDK 64-Bit Server VM (build 11.0.18+10-post-Ubuntu-0ubuntu120.04.1, mixed mode, sharing)
  Starting server from /usr/local/lib/python3.9/dist-packages/h2o/backend/bin/h2o.jar
  Ice root: /tmp/tmp7fslgtys
  JVM stdout: /tmp/tmp7fslgtys/h2o_unknownUser_started_from_python.out
  JVM stderr: /tmp/tmp7fslgtys/h2o_unknownUser_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.
H2O_cluster_uptime: 02 secs
H2O_cluster_timezone: Etc/UTC
H2O_data_parsing_timezone: UTC
H2O_cluster_version: 3.40.0.2
H2O_cluster_version_age: 7 days, 23 hours and 54 minutes
H2O_cluster_name: H2O_from_python_unknownUser_kavmgw
H2O_cluster_total_nodes: 1
H2O_cluster_free_memory: 3.172 Gb
H2O_cluster_total_cores: 2
H2O_cluster_allowed_cores: 2
H2O_cluster_status: locked, healthy
H2O_connection_url: http://127.0.0.1:54321
H2O_connection_proxy: {"http": null, "https": null}
H2O_internal_security: False
Python_version: 3.9.16 final
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
AutoML progress: |███████████████████████████████████████████████████████████████| (done) 100%

Leaderboard

Leaderboard shows models with their metrics. When provided with H2OAutoML object, the leaderboard shows 5-fold cross-validated metrics by default (depending on the H2OAutoML settings), otherwise it shows metrics computed on the frame. At most 20 models are shown by default.
model_id rmse mse mae rmsle mean_residual_deviance training_time_ms predict_time_per_row_msalgo
StackedEnsemble_BestOfFamily_3_AutoML_1_20230317_151130 98.106 9624.79 67.27750.0329657 9624.79 240 0.232783StackedEnsemble
StackedEnsemble_BestOfFamily_2_AutoML_1_20230317_151130 98.106 9624.79 67.27750.0329657 9624.79 288 0.069139StackedEnsemble
StackedEnsemble_AllModels_1_AutoML_1_20230317_151130 99.3945 9879.27 67.246 0.0311356 9879.27 566 0.18038 StackedEnsemble
StackedEnsemble_AllModels_2_AutoML_1_20230317_151130 99.9931 9998.62 67.65450.0312209 9998.62 283 0.167626StackedEnsemble
GBM_3_AutoML_1_20230317_151130 103.66 10745.5 69.31820.0348153 10745.5 642 0.034323GBM
GBM_4_AutoML_1_20230317_151130 143.498 20591.7 95.68420.0507063 20591.7 993 0.044029GBM
StackedEnsemble_BestOfFamily_1_AutoML_1_20230317_151130 158.688 25182 107.911 0.0539144 25182 774 0.044236StackedEnsemble
XGBoost_1_AutoML_1_20230317_151130 166.693 27786.6 117.504 0.0562133 27786.6 618 0.014036XGBoost
GBM_2_AutoML_1_20230317_151130 177.254 31418.8 120.446 0.0624025 31418.8 1162 0.027619GBM
XGBoost_grid_1_AutoML_1_20230317_151130_model_1 192.431 37029.7 147.774 0.0930426 37029.7 251 0.013005XGBoost
GBM_5_AutoML_1_20230317_151130 202.543 41023.8 131.344 0.070276 41023.8 324 0.023635GBM
XGBoost_2_AutoML_1_20230317_151130 206.521 42651 155.068 0.0683256 42651 659 0.012363XGBoost
XGBoost_grid_1_AutoML_1_20230317_151130_model_2 217.912 47485.4 168.958 0.0660485 47485.4 109 0.010368XGBoost
DRF_1_AutoML_1_20230317_151130 250.339 62669.4 176.565 0.0969177 62669.4 801 0.01938 DRF
XRT_1_AutoML_1_20230317_151130 298.545 89129.4 203.117 0.111638 89129.4 300 0.008238DRF
GBM_1_AutoML_1_20230317_151130 318.378 101364 243.11 0.135626 101364 1574 0.020177GBM
GLM_1_AutoML_1_20230317_151130 344.39 118604 257.38 0.129314 118604 100 0.004387GLM
GBM_grid_1_AutoML_1_20230317_151130_model_1 380.558 144824 260.046 0.141757 144824 251 0.019313GBM
DeepLearning_1_AutoML_1_20230317_151130 396.334 157080 292.219 0.157206 157080 106 0.010369DeepLearning
XGBoost_3_AutoML_1_20230317_151130 1228.5 1.50921e+061087.23 0.302479 1.50921e+06 687 0.005119XGBoost
[20 rows x 9 columns]

Residual Analysis

Residual Analysis plots the fitted values vs residuals on a test dataset. Ideally, residuals should be randomly distributed. Patterns in this plot can indicate potential problems with the model selection, e.g., using simpler model than necessary, not accounting for heteroscedasticity, autocorrelation, etc. Note that if you see "striped" lines of residuals, that is an artifact of having an integer valued (vs a real valued) response variable.

Learning Curve Plot

Learning curve plot shows the loss function/metric dependent on number of iterations or trees for tree-based algorithms. This plot can be useful for determining whether the model overfits.

Variable Importance

The variable importance plot shows the relative importance of the most important variables in the model.

Variable Importance Heatmap

Variable importance heatmap shows variable importance across multiple models. Some models in H2O return variable importance for one-hot (binary indicator) encoded versions of categorical columns (e.g. Deep Learning, XGBoost). In order for the variable importance of categorical columns to be compared across all model types we compute a summarization of the the variable importance across all one-hot encoded features and return a single variable importance for the original categorical feature. By default, the models and variables are ordered by their similarity.

Model Correlation

This plot shows the correlation between the predictions of the models. For classification, frequency of identical predictions is used. By default, models are ordered by their similarity (as computed by hierarchical clustering). Interpretable models, such as GAM, GLM, and RuleFit are highlighted using red colored text.

SHAP Summary

SHAP summary plot shows the contribution of the features for each instance (row of data). The sum of the feature contributions and the bias term is equal to the raw prediction of the model, i.e., prediction before applying inverse link function.

Partial Dependence Plots

Partial dependence plot (PDP) gives a graphical depiction of the marginal effect of a variable on the response. The effect of a variable is measured in change in the mean response. PDP assumes independence between the feature for which is the PDP computed and the rest.





Individual Conditional Expectation

An Individual Conditional Expectation (ICE) plot gives a graphical depiction of the marginal effect of a variable on the response. ICE plots are similar to partial dependence plots (PDP); PDP shows the average effect of a feature while ICE plot shows the effect for a single instance. This function will plot the effect for each decile. In contrast to the PDP, ICE plots can provide more insight, especially when there is stronger feature interaction.





Residual Analysis

Residual Analysis plots the fitted values vs residuals on a test dataset. Ideally, residuals should be randomly distributed. Patterns in this plot can indicate potential problems with the model selection, e.g., using simpler model than necessary, not accounting for heteroscedasticity, autocorrelation, etc. Note that if you see "striped" lines of residuals, that is an artifact of having an integer valued (vs a real valued) response variable.

Learning Curve Plot

Learning curve plot shows the loss function/metric dependent on number of iterations or trees for tree-based algorithms. This plot can be useful for determining whether the model overfits.

Partial Dependence Plots

Partial dependence plot (PDP) gives a graphical depiction of the marginal effect of a variable on the response. The effect of a variable is measured in change in the mean response. PDP assumes independence between the feature for which is the PDP computed and the rest.















Individual Conditional Expectation

An Individual Conditional Expectation (ICE) plot gives a graphical depiction of the marginal effect of a variable on the response. ICE plots are similar to partial dependence plots (PDP); PDP shows the average effect of a feature while ICE plot shows the effect for a single instance. This function will plot the effect for each decile. In contrast to the PDP, ICE plots can provide more insight, especially when there is stronger feature interaction.















In [4]:
import h2o
from h2o.automl import H2OAutoML

h2o.init()

# Import data
f = "/content/drive/MyDrive/Qualities in Intelligent Systems/Bike-Sharing-Dataset/hour.csv"
df = h2o.import_file(f)

# Reponse column
y = "cnt"

# Split into train & test
splits = df.split_frame(ratios = [0.8], seed = 1)
train = splits[0]
test = splits[1]

# Run AutoML for 1 minute
aml = H2OAutoML(max_runtime_secs=60, seed=1)
aml.train(y=y, training_frame=train)

# Explain leader model & compare with all AutoML models
exa = aml.explain(test)

# Explain a single H2O model (e.g. leader model from AutoML)
exm = aml.leader.explain(test)
Checking whether there is an H2O instance running at http://localhost:54321. connected.
H2O_cluster_uptime: 6 mins 26 secs
H2O_cluster_timezone: Etc/UTC
H2O_data_parsing_timezone: UTC
H2O_cluster_version: 3.40.0.2
H2O_cluster_version_age: 8 days
H2O_cluster_name: H2O_from_python_unknownUser_kavmgw
H2O_cluster_total_nodes: 1
H2O_cluster_free_memory: 2.965 Gb
H2O_cluster_total_cores: 2
H2O_cluster_allowed_cores: 2
H2O_cluster_status: locked, healthy
H2O_connection_url: http://localhost:54321
H2O_connection_proxy: {"http": null, "https": null}
H2O_internal_security: False
Python_version: 3.9.16 final
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
AutoML progress: |███████████████████████████████████████████████████████████████| (done) 100%

Leaderboard

Leaderboard shows models with their metrics. When provided with H2OAutoML object, the leaderboard shows 5-fold cross-validated metrics by default (depending on the H2OAutoML settings), otherwise it shows metrics computed on the frame. At most 20 models are shown by default.
model_id rmse mse mae rmsle mean_residual_deviance training_time_ms predict_time_per_row_msalgo
StackedEnsemble_AllModels_1_AutoML_2_20230317_151751 2.11301 4.46482 1.68988nan 4.46482 474 0.013401StackedEnsemble
StackedEnsemble_BestOfFamily_2_AutoML_2_20230317_151751 2.11301 4.46482 1.68988nan 4.46482 449 0.007542StackedEnsemble
StackedEnsemble_BestOfFamily_1_AutoML_2_20230317_151751 2.11685 4.48106 1.69214nan 4.48106 701 0.00787 StackedEnsemble
GLM_1_AutoML_2_20230317_151751 3.08055 9.48976 2.1927 nan 9.48976 289 0.00121 GLM
XGBoost_1_AutoML_2_20230317_151751 7.14944 51.1145 4.704 0.0928143 51.1145 2942 0.004964XGBoost
GBM_1_AutoML_2_20230317_151751 12.3146 151.649 7.45037 0.149081 151.649 3393 0.01313 GBM
DRF_1_AutoML_2_20230317_151751 21.6649 469.367 10.9212 0.147284 469.367 441 0.001548DRF
GBM_3_AutoML_2_20230317_151751 96.1387 9242.64 74.8388 1.19714 9242.64 300 0.002984GBM
GBM_4_AutoML_2_20230317_151751 117.481 13801.8 91.5657 1.30082 13801.8 166 0.00264 GBM
GBM_2_AutoML_2_20230317_151751 128.581 16533 100.298 1.35206 16533 154 0.001923GBM
XGBoost_3_AutoML_2_20230317_151751 180.168 32460.7 130.209 1.13754 32460.7 53 0.000492XGBoost
XGBoost_2_AutoML_2_20230317_151751 180.685 32647.2 130.09 1.12879 32647.2 593 0.000833XGBoost
[12 rows x 9 columns]

Residual Analysis

Residual Analysis plots the fitted values vs residuals on a test dataset. Ideally, residuals should be randomly distributed. Patterns in this plot can indicate potential problems with the model selection, e.g., using simpler model than necessary, not accounting for heteroscedasticity, autocorrelation, etc. Note that if you see "striped" lines of residuals, that is an artifact of having an integer valued (vs a real valued) response variable.

Learning Curve Plot

Learning curve plot shows the loss function/metric dependent on number of iterations or trees for tree-based algorithms. This plot can be useful for determining whether the model overfits.

Variable Importance

The variable importance plot shows the relative importance of the most important variables in the model.

Variable Importance Heatmap

Variable importance heatmap shows variable importance across multiple models. Some models in H2O return variable importance for one-hot (binary indicator) encoded versions of categorical columns (e.g. Deep Learning, XGBoost). In order for the variable importance of categorical columns to be compared across all model types we compute a summarization of the the variable importance across all one-hot encoded features and return a single variable importance for the original categorical feature. By default, the models and variables are ordered by their similarity.

Model Correlation

This plot shows the correlation between the predictions of the models. For classification, frequency of identical predictions is used. By default, models are ordered by their similarity (as computed by hierarchical clustering). Interpretable models, such as GAM, GLM, and RuleFit are highlighted using red colored text.

SHAP Summary

SHAP summary plot shows the contribution of the features for each instance (row of data). The sum of the feature contributions and the bias term is equal to the raw prediction of the model, i.e., prediction before applying inverse link function.

Partial Dependence Plots

Partial dependence plot (PDP) gives a graphical depiction of the marginal effect of a variable on the response. The effect of a variable is measured in change in the mean response. PDP assumes independence between the feature for which is the PDP computed and the rest.





Individual Conditional Expectation

An Individual Conditional Expectation (ICE) plot gives a graphical depiction of the marginal effect of a variable on the response. ICE plots are similar to partial dependence plots (PDP); PDP shows the average effect of a feature while ICE plot shows the effect for a single instance. This function will plot the effect for each decile. In contrast to the PDP, ICE plots can provide more insight, especially when there is stronger feature interaction.





Residual Analysis

Residual Analysis plots the fitted values vs residuals on a test dataset. Ideally, residuals should be randomly distributed. Patterns in this plot can indicate potential problems with the model selection, e.g., using simpler model than necessary, not accounting for heteroscedasticity, autocorrelation, etc. Note that if you see "striped" lines of residuals, that is an artifact of having an integer valued (vs a real valued) response variable.

Learning Curve Plot

Learning curve plot shows the loss function/metric dependent on number of iterations or trees for tree-based algorithms. This plot can be useful for determining whether the model overfits.

Partial Dependence Plots

Partial dependence plot (PDP) gives a graphical depiction of the marginal effect of a variable on the response. The effect of a variable is measured in change in the mean response. PDP assumes independence between the feature for which is the PDP computed and the rest.
















Individual Conditional Expectation

An Individual Conditional Expectation (ICE) plot gives a graphical depiction of the marginal effect of a variable on the response. ICE plots are similar to partial dependence plots (PDP); PDP shows the average effect of a feature while ICE plot shows the effect for a single instance. This function will plot the effect for each decile. In contrast to the PDP, ICE plots can provide more insight, especially when there is stronger feature interaction.
















In [5]:
from sklearn.datasets import make_regression
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
from sklearn import metrics

# Import ddata
df=pd.read_csv("/content/drive/MyDrive/Qualities in Intelligent Systems/Bike-Sharing-Dataset/day.csv")

X = df.drop("cnt", axis=1)

y = df.pop("cnt")
X, y = make_regression(random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
reg = GradientBoostingRegressor(random_state=0)
reg.fit(X_train, y_train)
GradientBoostingRegressor(random_state=0)
reg.predict(X_test[1:2])
reg.score(X_test, y_test)
Out[5]:
0.4403245677708285

To understand how various environmental conditions affect the number of bikes rented, we can create different plots to visualize the relationship between the input features and the target variable (i.e., the number of bikes rented).

  1. Scatter Plots: A scatter plot can be used to visualize the relationship between two continuous variables, such as temperature and the number of bikes rented. We can plot temperature on the x-axis and the number of bikes rented on the y-axis. If there is a positive correlation between temperature and the number of bikes rented, we would expect to see a diagonal line moving from the lower left to the upper right of the plot.
  2. Line Charts: A line chart can be used to visualize the trends of time-dependent variables, such as the day of the week or the month. We can plot the average number of bikes rented on the y-axis and time on the x-axis. This can help us to see if there are any patterns or trends in the data, such as more bikes being rented on weekends compared to weekdays.
  3. Bar Charts: A bar chart can be used to visualize the relationship between a categorical variable and a continuous variable, such as weather conditions and the number of bikes rented. We can plot the average number of bikes rented on the y-axis and weather conditions on the x-axis. This can help us to see if there are any differences in the number of bikes rented on different weather conditions.
  4. Heatmaps: A heatmap can be used to visualize the relationship between two categorical variables, such as month and weather conditions. We can plot the average number of bikes rented on the y-axis, month on the x-axis, and use a color scale to represent the number of bikes rented for each weather condition. This can help us to see if there are any patterns or trends in the data, such as more bikes being rented in the summer compared to the winter. By using these different types of plots, we can gain insights into how various environmental conditions affect the number of bikes rented. We can use these insights to make informed decisions on how to optimize bike rental services based on the current weather, time of the day, or season.